Walkthrough: PII Redaction

The PII Redaction SDK allows you to easily use state-of-the-art models to remove PII information from text.

!pip install datasets

# This installs the version of spaCy compatible with CUDA 12.x for GPU use.
# To successfully use GPU acceleration, ensure that your system has a compatible CUDA version installed.
# Specific steps for setting up sGPU acceleration can be found here: https://spacy.io/usage
!pip install spacy'[cuda12x]'

Import the find_pii function from the dynamofl.privacy library.

import json
from datasets import load_dataset
from dynamofl.privacy import find_pii

1. Redact PII using the `find_pii()` method

The SDK simplifies Personally Identifiable Information (PII) redaction with an easy-to-use function called find_pii().

The find_pii() method returns a dictionary with the following keys:

f the text is of type str:
- redacted_text (str): The redacted string with PII removed.
- redacted_entities (dict): A dictionary where keys are entity types (e.g., 'names', 'emails') and values are lists of the redacted entities found.
- redacted_entity_positions (list of tuples): Positions of redacted entities.
If the text is of type List[str] or Dataset (HuggingFace dataset):
- redacted_dataset (Lisr[str] or Dataset): The list or dataset of strings where PII has been redacted.
- redacted_entities (dict): A dictionary where keys are entity types (e.g., 'names', 'emails') and values are lists of the redacted entities found.

text_with_pii= """DynamoFL which offers software to bring large language models (LLMs)
to enterprises and fine-tune those models on sensitive data, today announced that
it raised $15.1 million in a Series A funding round led by Canapi Ventures and Nexus Venture Partners."""

# Redact PII with spaCy
pii_results = find_pii(
    model_type="transformers", 
    text=text_with_pii,
    entity_types=["PERSON", "ORG", "DATE", "MONEY"],
    model_config={"lang_code": "en", "model": "dynamofl-sandbox/pii-roberta-large"},
)
print(pii_results['redacted_text'])
print()
print(json.dumps(pii_results['redacted_entities'], indent=2))

[ORG] which offers software to bring large language models (LLMs) to enterprises and fine-tune those models on sensitive data, [DATE] announced that it raised <MONEY> in a Series A funding round led by [ORG] and [ORG].

{ "ORG": { "[ORG]": [ "Nexus Venture Partners", "Canapi Ventures", "DynamoFL" ] }, "MONEY": { "[MONEY]": [ "$15.1 million" ] }, "DATE": { "[DATE]": [ "today" ] } }

2. Use any public or custom model

With the SDK, you have the flexibility to use any public or custom flair or spaCy model for PII redaction. This capability is made possible through the model_config parameter in the find_pii() function. Currently, the PII SDK supports English token classification models. To configure your model, use the following format: {"lang_code": "en", "model": "<your model ID or path>"}

It is worth noting that open-source NER providers do not provide support for a wide range of models. The SDK ensures a consistent experience across all providers (transformers, spacy, flair and presidio), allowing you to incorporate various models into your PII redaction workflow. Please refer to section (3) on how to redact custom entites using your custom models.

Each provider supports the following models:

transformers: Any public token classification model listed here, or any custom model.
spaCy: Any public model listed here, or any custom model.
flair: Any public model listed here, or any custom model.
presidio: Any public model listed here, or any custom model.

text_with_pii = """DynamoFL, which offers software to bring large language models (LLMs)
to enterprises and fine-tune those models on sensitive data, today announced that
it raised $15.1 million in a Series A funding round led by Canapi Ventures and Nexus Venture Partners."""

pii_results = find_pii(
    model_type="transformers",
    text=text_with_pii,
    entity_types=["ORG"],
    model_config={"lang_code": "en", "model": "dslim/bert-base-NER"},
)

print(pii_results["redacted_text"])
print()
print(json.dumps(pii_results["redacted_entities"], indent=2))

[ORG], which offers software to bring large language models (LLMs) to enterprises and fine-tune those models on sensitive data, today announced that it raised $15.1 million in a Series A funding round led by [ORG] and [ORG].

{ "ORG": { "[ORG]": [ "Nexus Venture Partners", "Canapi Ventures", "DynamoFL" ] } }

3. Redact strings, HuggingFace datasets and custom datasets

The SDK simplifies the process of redacting personally identifiable information (PII) across various data structures. Whether you are dealing with individual strings, HuggingFace datasets, or custom datasets, the find_pii() function provides a straightforward and unified approach to safeguard sensitive information. The examples shown below use spacy but presidio or flair may also be used.

3.1 Redact individual strings

text_with_pii= """DynamoFL which offers software to bring large language models (LLMs)
to enterprises and fine-tune those models on sensitive data, today announced that
it raised $15.1 million in a Series A funding round led by Canapi Ventures and Nexus Venture Partners."""

pii_results = find_pii(
    model_type="transformers", 
    text=text_with_pii,
    entity_types=["PERSON", "ORG", "DATE", "MONEY"],
    model_config={"lang_code": "en", "model": "dynamofl-sandbox/pii-roberta-large"},
)
print(pii_results['redacted_text'])
print()
print(json.dumps(pii_results['redacted_entities'], indent=2))

[ORG] which offers software to bring large language models (LLMs) to enterprises and fine-tune those models on sensitive data, [DATE] announced that it raised [MONEY] in a Series A funding round led by [ORG] and [ORG].

{ "ORG": { "[ORG]": [ "Nexus Venture Partners", "Canapi Ventures", "DynamoFL" ] }, "MONEY": { "[MONEY]": [ "$15.1 million" ] }, "DATE": { "[DATE]": [ "today" ] } }

3.2 Redact a HuggingFace dataset

dataset = load_dataset("tweet_eval", "stance_climate")

pii_results = find_pii(
    model_type="transformers",
    text=dataset,
    dataset_config={
        "text_column": "text",
        "train_name": "train",
    },
    entity_types = ["CARDINAL", "DATE", "EVENT", "FAC", "GPE", 
        "LANGUAGE", "LAW", "LOC", "MONEY", "NORP", 
        "ORDINAL", "ORG", "PERCENT", "PERSON", "PRODUCT", 
        "QUANTITY", "TIME", "WORK_OF_ART"
    ],
    model_config={"lang_code": "en", "model": "dynamofl-sandbox/pii-roberta-large"},
)

# Display the keys as the entity dictionary is large
print(pii_results["redacted_entities"].keys())

dict_keys(['PERSON', 'GPE', 'ORG', 'TIME', 'EVENT', 'CARDINAL', 'QUANTITY', 'LOC', 'NORP', 'DATE', 'ORDINAL', 'WORK_OF_ART', 'FAC', 'PRODUCT', 'MONEY', 'LAW', 'PERCENT'])

3.3 Redact a custom dataset

The PII SDK accepts custom datasets as a list of strings. This is done for generalizablity.

custom_dataset = [
    """
        DynamoFL, which offers software to bring large language models (LLMs) to
        enterprises and fine-tune those models on sensitive data, today announced
        that it raised $15.1 million in a Series A funding round co-led by
        Canapi Ventures and Nexus Venture Partners.
    """,
    """
        The tranche, with had participation from Formus Capital and Soma Capital,
        brings DynamoFL’s total raised to $19.3 million. Co-founder and CEO Vaikkunth Mugunthan
        says that the proceeds will be put toward expanding DynamoFL’s product
        offerings and growing its team of privacy researchers.
    """,
    """
        Taken together, DynamoFL’s product offering allows enterprises to develop private
        and compliant LLM solutions without compromising on performance,” Mugunthan told
        TechCrunch in an email interview.
    """,
]

pii_results = find_pii(
    model_type="transformers",
    text=custom_dataset,
    entity_types=["PERSON", "ORG", "DATE", "MONEY"],
    model_config={"lang_code": "en", "model": "dynamofl-sandbox/pii-roberta-large"},
)
print(json.dumps(pii_results["redacted_entities"], indent=2))

{ "ORG": { "[ORG]": [ "Nexus Venture Partners", "Canapi Ventures", "DynamoFL", "Soma Capital", "Formus Capital", "TechCrunch" ] }, "MONEY": { "[MONEY]": [ " $15.1", "$ 19.3" ] }, "DATE": { "[DATE]": [ "today" ] }, "PERSON": { "[PERSON]": [ "Vaikkunth Mugunthan", "Mugunthan" ] } }

4. Supported entity types

The PII Redaction SDK supports the following pre-defined entity types for English models:

transformers: Entity types to be redacted must be specified using the entity_types parameter in the find_pii() function. dynamofl-sandbox/pii-roberta-large supports the following 18 entity classes: CARDINAL, DATE, EVENT, FAC, GPE, LANGUAGE, LAW, LOC, MONEY, NORP, ORDINAL, ORG, PERCENT, PERSON, PRODUCT, QUANTITY, TIME, WORK_OF_ART
spaCy: CARDINAL, DATE, EVENT, FAC, GPE, LANGUAGE, LAW, LOC, MONEY, NORP, ORDINAL, ORG, PERCENT, PERSON, PRODUCT, QUANTITY, TIME, WORK_OF_ART
flair: Flair models offer two pre-defined entity configurations:
- 4 class: PER, ORG, LOC, MISC
- 18 class: CARDINAL, DATE, EVENT, FAC, GPE, LANGUAGE, LAW, LOC, MONEY, NORP, ORDINAL, ORG, PERCENT, PERSON, PRODUCT, QUANTITY, TIME, WORK_OF_ART
presidio (uses spacy models under the hood): CARDINAL, DATE, EVENT, FAC, GPE, LANGUAGE, LAW, LOC, MONEY, NORP, ORDINAL, ORG, PERCENT, PERSON, PRODUCT, QUANTITY, TIME, WORK_OF_ART

NOTE 1: Optionally, a subset can be specified using the entity_types parameter in the find_pii() function. This parameter must be specified in the following cases:

NOTE 2: Dynamofl recommends using dynamofl-sandbox/pii-roberta-large for redacting sensitive information.

text_with_pii = """DynamoFL, which offers software to bring large language models (LLMs)
to enterprises and fine-tune those models on sensitive data, today announced that
it raised $15.1 million in a Series A funding round led by Canapi Ventures and Nexus Venture Partners."""

pii_results = find_pii(
    model_type="transformers",
    text=text_with_pii,
    entity_types=["DATE"], # Redact only the date entity; use a subset of the entity types supported by the model
    model_config={"lang_code": "en", "model": "dynamofl-sandbox/pii-roberta-large"},
)

print(pii_results["redacted_text"])
print()
print(pii_results["redacted_entities"])

DynamoFL, which offers software to bring large language models (LLMs) to enterprises and fine-tune those models on sensitive data, <DATE> announced that it raised $15.1 million in a Series A funding round led by Canapi Ventures and Nexus Venture Partners.

{'DATE': {'[DATE]': ['today']}}

5. Regex and exact match support

The PII Redaction SDK also supports regex entity types and exact match entity types with Transformers and Presidio. To do so, use the custom_entity_config parameter. Set it to a configuration dictionary, including all the regex and exact match entity types to tag. This support will be extended to spaCy and flair soon!

The custom_entity_config is a nested dictionary containing the following keys:

*entity_type: This signifies the type of custom entity and serves as the key.
recognizer_type: This indicates the type of recognizer, which can be 'regex' or 'deny-list'.
deny_list: This is a list of patterns to be added to the deny-list if the 'deny-list' recognizer is used.
regex: This is a regular expression pattern to be used if the 'regex' recognizer is chosen.

dataset = load_dataset("tweet_eval", "stance_climate")

# Define parameters for custom entity types
custom_entity_config = {
    "CLIMATE_PII": {
        "recognizer_type": "deny-list",
        "deny_list": ["#SemST", "#environment", "#COP21"],
    },
    "MENTION": {
        "recognizer_type": "regex",
        "regex": r"(@\w+\s*)+",
        "score": 1.0,
    },
}

pii_results = find_pii(
    model_type="transformers",
    text=dataset,
    dataset_config={
        "text_column": "text",
        "train_name": "train",
    },
    # the dataset has no WORK_OF_ART entities, so this entity type will not be present in the results
    entity_types=["MENTION", "CLIMATE_PII", "WORK_OF_ART"], 
    custom_entity_config=custom_entity_config,
    model_config={"lang_code": "en", "model": "dynamofl-sandbox/pii-roberta-large"},
)
print(json.dumps(pii_results["redacted_entities"], indent=2))

{ "CLIMATE_PII": { "[CLIMATE_PII]": [ "#SemST", "#environment", "#COP21" ] }, "MENTION": { "[MENTION]": [ "@user ", "@user @user ", "@user @user @user @user @user ", "@user @user @user @user ", "@TonyAbbottMHR ", "@msimire ", "@user @user @user ", "@user @user @user @user ", "@solarimpulse ", "@CreeClayton ", "@RobSilver ", "@potus ", "@ClimatParis2015", "@quinn43 ", "@MexONU ", "@user @user ", "@user ", "@BlissTabitha ", "@user @user " ] }, "WORK_OF_ART": { "[WORK_OF_ART]": [ "#ChasingIce", ""The Whale and the Supercomputer. On the Northern Front of Climate Change", "The Biggest Story In World Podcast", "TheBachelorette", "CaptainPlanet", "bible", "futurama", ""Crimes of the Hot"", "Futurama" ] } }

6. PII Post-processing

The PII SDK includes a post-processing feature that can be seamlessly integrated into your workflow after detecting personally identifiable information (PII). This is a user-defined function that is designed to dynamically replace identified PII with redacted text and can be applied to both pre-defined and custom entities. The example shown below uses presidio to dynamically redact a regex entity type.

import re

text_with_age = "Jack is 9. John is 26. Jill is 18."

# Create a callback for processing age mentions
def age_custom_redacted_text(match):
    age = int(match)
    redacted_tag = ""
    if age < 10:
        redacted_tag = "[<10]"
    elif 10 <= age <= 20:
        redacted_tag = "[10-20]"
    else:
        redacted_tag = "[>20]"
    return redacted_tag

custom_entity_config = {
    "AGE": {
        "recognizer_type": "regex",
        "regex": r"\b\d{1,2}\b",
        "score": 1.0,
        "redacted_text_callback": age_custom_redacted_text,
    },
}

pii_results = find_pii(
    model_type="transformers",
    text=text_with_age,
    entity_types=["AGE", "PERSON"],
    custom_entity_config=custom_entity_config,
    model_config={"lang_code": "en", "model": "dynamofl-sandbox/pii-roberta-large"},
)
print(pii_results['redacted_text'])
print()
print(json.dumps(pii_results['redacted_entities'], indent=2))

[PERSON] is [<10]. [PERSON] is [>20]. [PERSON] is [10-20].

{ "AGE": { "[10-20]": [ "18" ], "[>20]": [ "26" ], "[<10]": [ "9" ] }, "PERSON": { "[PERSON]": [ "Jill", "John", "Jack" ] } }

7: Unique anonymization

The SDK's "unique anonymization" feature introduces a technical method for assigning consistent identifiers to the entities found within a document. identical entities across different parts of the text will share the same unique identifier.

text_with_repeated_entities = "John Doe lives in New York. James Doe recently moved to the same city."

'''
result_flair = find_pii(model_type="flair", text=text_with_repeated_entities, unique_anonymization=True)
result_spacy = find_pii(model_type="spacy", text=text_with_repeated_entities, unique_anonymization=True)
result_presidio = find_pii(model_type="presidio", text=text_with_repeated_entities, unique_anonymization=True)
'''

pii_results = find_pii(
    model_type="transformers", 
    text=text_with_repeated_entities, 
    entity_types=["PERSON", "GPE"],
    model_config={"lang_code": "en", "model": "dynamofl-sandbox/pii-roberta-large"}, 
    unique_anonymization=True
)

print(pii_results['redacted_text'])
print()
print(json.dumps(pii_results['redacted_entities'], indent=2))

[PERSON_1] lives in [GPE_1]. [PERSON_2] recently moved to the same city.

{ "PERSON": { "[PERSON_2]": [ "James Doe" ], "[PERSON_1]": [ "John Doe" ] }, "GPE": { "[GPE_1]": [ "New York" ] } }

8. Combinations of unique and non-unique anonymization

The PII SDK offers a versatile feature that allows the combination of unique and non-unique anonymization strategies. This feature is particularly useful when dealing with various types of personally identifiable information (PII) within a dataset.

Unique anonymization: Unique anonymization ensures that each instance of a PII entity is replaced with a distinct identifier.

Non-unique anonymization: Non-unique anonymization, on the other hand, replaces each instance of a PII entity with a common identifier.

Combinations of anonymization strategies: The PII SDK enables the combination of unique and non-unique anonymization strategies within a single redaction process. This can be achieved through the unique_anonymization parameter, allowing users to specify the "unique anonymization" strategy for certain entity types.

text_with_pii = "Google is a multinational technology company, its CEO is Sundar Pichai. Apple is a global technology company, its CEO is Tim Cook."

pii_results = find_pii(
    model_type="transformers",
    text=text_with_pii,
    entity_types=["ORG", "PERSON"],
    unique_anonymization=["PERSON"],
    model_config={"lang_code": "en", "model": "dynamofl-sandbox/pii-roberta-large"},
)
print(pii_results['redacted_text'])
print()
print(json.dumps(pii_results['redacted_entities'], indent=2))

[ORG] is a multinational technology company, its CEO is [PERSON_1]. [ORG] is a global technology company, its CEO is [PERSON_2].

{ "PERSON": { "[PERSON_2]": [ "Tim Cook" ], "[PERSON_1]": [ "Sundar Pichai" ] }, "ORG": { "[ORG]": [ "Apple", "Google" ] } }

9. Identify but prevent redaction of certain types of PII

The PII SDK allows users to selectively identify and exclude specific types of personally identifiable information (PII) from the redaction process. This parameter is useful when there's a need to retain certain PII elements in their original form. This is done through the no_redact parameter. The example shown below uses spacy, as usual.

NOTE: The no_redact parameter can be combined with the unique_anonymization parameter or can be applied to custom entity types, enabling complex PII redaction.

text_with_pii = string = """DynamoFL which offers software to bring large language models (LLMs)
to enterprises and fine-tune those models on sensitive data, today announced that
it raised $15.1 million in a Series A funding round led by Canapi Ventures and Nexus Venture Partners."""

pii_results = find_pii(
    model_type="transformers",
    text=string,
    no_redact=["MONEY"],
    entity_types=["PERSON", "ORG", "DATE", "MONEY"],
    model_config={"lang_code": "en", "model": "dynamofl-sandbox/pii-roberta-large"},
)
print(pii_results['redacted_text'])
print()
print(pii_results['redacted_entities'])

[ORG] which offers software to bring large language models (LLMs) to enterprises and fine-tune those models on sensitive data, [DATE] announced that it raised $15.1 million in a Series A funding round led by [ORG] and [ORG].

{'ORG': {'[ORG]': ['Nexus Venture Partners', 'Canapi Ventures', 'DynamoFL']}, 'MONEY': {' $15.1 million': ['$ 15.1 million']}, 'DATE': {'[DATE]': ['today']}}

10. Enchanced redaction speed

Our SDK brings about a notable enhancement in redaction speed, employing an algorithm with a time complexity of O(n log L), where:

'n': Represents the number of personally identifiable information (PII) instances.
'L': Denotes the average length of PII.

It's essential to highlight that both spaCy and Flair do not offer native support for redaction.

SpaCy can only determine the positions of PII in the original text; it lacks the capability to redact the text.

From the spaCy documentation:

The standard way to access entity annotations is the doc.ents property, which produces a sequence of Span objects. The entity type is accessible either as a hash value or as a string, using the attributes ent.label and ent.label_. The Span object acts as a sequence of tokens, so you can iterate over the entity or index into it. You can also get the text form of the whole entity, as though it were a single token.

From the flair documentation:

Entities in this case are Span objects that have a number of fields you can access, such as .text. You can also iterate through all tokens of a span and access their text, idx and other fields:

from flair.models import SequenceTagger
tagger = SequenceTagger.load('ner')
sentence = Sentence('George Washington went to Washington .')
tagger.predict(sentence)
for entity in sentence.get_spans('ner'):
    # print entity
    print(entity)

    # print only the entity text
    print(entity.text)

    # go through each token in entity and print its idx
    for token in entity:
        print(token.idx)

1. Redact PII using the find_pii() method​

2. Use any public or custom model​

3. Redact strings, HuggingFace datasets and custom datasets​

4. Supported entity types​

5. Regex and exact match support​

6. PII Post-processing​

7: Unique anonymization​

8. Combinations of unique and non-unique anonymization​

9. Identify but prevent redaction of certain types of PII​

10. Enchanced redaction speed​